Notes on correctness of p-values when analyzing experiments using SAS and R

It is commonly believed that if a two-way analysis of variance (ANOVA) is carried out in R, then reported p-values are correct. This article shows that this is not always the case. Results can vary from non-significant to highly significant, depending on the choice of options. The user must know exactly which options result in correct p-values, and which options do not. Furthermore, it is commonly supposed that analyses in SAS and R of simple balanced experiments using mixed-effects models result in correct p-values. However, the simulation study of the current article indicates that frequency of Type I error deviates from the nominal value. The objective of this article is to compare SAS and R with respect to correctness of results when analyzing small experiments. It is concluded that modern functions and procedures for analysis of mixed-effects models are sometimes not as reliable as traditional ANOVA based on simple computations of sums of squares.


Introduction
In this digital age, researchers are presented with an unprecedented opportunity to analyze a wide range of data quickly and efficiently using statistical software.While these software aim to employ the same methods for data analysis, the results they produce can vary broadly, which in turn leads to different conclusions being drawn.Sometimes, conclusions may even be incorrect.Hence comes the importance of the statistical method for data analysis.
Within the various software, an array of options are available to analyze linear fixedeffects and mixed-effects models.Here, the question posed is as to which of the results obtained are more accurate.We examine functions in R [1] and procedures of SAS [2] for linear fixed-effects and mixed-effects models.For the fixed-effects case, we discuss unbalanced two-factorial experiments and type 3 tests, while for the mixed-effects case, we examine balanced randomized complete block experiments and split-plot experiments with focus on Type I error.
In fixed-effects models, there are exact F-tests, calculated from sums of squares in classical analysis of variance table [3].There are different types of sums of squares commonly known as types 1, 2 and 3.For balanced datasets, results are identical, whereas for unbalanced datasets the different types are associated with different null hypotheses [4].There have been and still are many different viewpoints regarding the types of sums of squares [5,6].The following are recurring arguments: Type 1 sums of squares are useful when cell frequencies are good estimates of population proportions [7]; Type 2 sums of squares have better power than type 3 sums of squares when interaction is negligible [8]; and type 3 sums of squares are appropriate when observations are missing at random, as their null hypotheses are not dependent on the cell frequencies [4,9].The choice of type depends on the question the researcher wants to answer [10].
Nowadays, the linear mixed-effects model is used extensively and has in practice overrun the classical analysis of variance [11].The linear mixed-effects models are advantageous when applied in an array of sophisticated problems.The F-tests obtained using linear mixed-effects models are, however, in general approximate.The F-test statistics are no longer ratios of simple mean squared errors from analysis of variance, but are computed using linear functions of parameter estimates.These statistics are also functions of variance components associated with random effects.Variance components can be estimated by equating the observed mean squares to their expected values in a simple analysis of variance setting, or by using likelihoodbased methods such as the restricted maximum likelihood (REML) method [12,13].Although the computed variance components can be negative [11,14], they can still be interpreted in a legitimate manner [15,16].In practice, the estimates of the variance components are constrained to be non-negative, which may affect the assumption on the appropriate number of degrees of freedom in the denominator of the F-distribution.Different methods for computing the denominator degrees of freedom are available [17].The Kenward and Roger method [18,19] is recommended [20][21][22][23].However, this method is implemented differently in SAS and R with regard to how the number of degrees of freedom is calculated when estimates of the variance components are zero.This poses a problem at hypothesis testing of fixed effects, because different software give different results.
Bearing in mind the many limitations associated with traditional analysis of variance, the linear mixed-effects model has gained a broader application.It is however necessary to carry out the analysis carefully, even in simple cases.In this study, we investigate the implications of the available options in the different software with regard to tests of fixed effects in linear fixed-effects and mixed-effects models.
As regards fixed-effects models, we are particularly interested in the type 3 tests provided by the Anova function of the car package [24] and the ezANOVA function of the ez package [25] in R. It has been noted that the Anova function 'does not always give the same results as SAS does.' [5] From a practical point of view, this important observation deserves more attention.We will present an example that displays differences in p-values computed in SAS and R, depending on the procedure or function used.Through this example, we shall clarify what null hypotheses are actually tested.
Regarding mixed-effects models, we investigate how R and SAS handle non-positive estimates of variance components and what consequences it has for the p-values.In our experience, non-positive estimates are common.We will draw attention to the fact that statistical packages handle this situation differently and discuss the alternatives.
The next section, Background and methods, recalls the theory and specify models.The Examples section provides illustrating examples.The unbalanced two-factorial experiment uses a fixed-effects model and notes differences between R and SAS with regard to results of type 3 tests.The randomized complete block and split-plot experiments use mixed-effects models with non-positive estimates of variance components and observes differences between R and SAS regardless of the type of test.The Simulation studies section investigates mixedeffects models further.In this section, frequency of Type I error is estimated through simulation.The Practical advice section gives useful recommendations.The article ends with a Discussion.

General model
A general fixed-effects linear model can be written as where Y is an n × 1 vector of observations, X is a known n × m design matrix, β is an m × 1 vector of unknown fixed-effects parameters, e is an n × 1 vector of normally distributed random effects assumed to have mean 0 and variance-covariance matrix s 2 e I n , where I n is an n × n identity matrix.If X is of full rank, then the best linear unbiased estimator is obtained using the least-squares solution b ¼ ðX 0 XÞ À 1 X 0 Y. Otherwise, if X is not of full rank, i.e., if some columns are linearly dependent, the model is said to be overparameterized and the inverse of X 0 X cannot be computed.Instead a generalized inverse, (X 0 X) − , can be used, and the solution will be b ¼ ðX 0 XÞ À X 0 Y [3].This solution is not unique, because it depends on the choice of the generalized inverse.In practice, the sweep operator may be used for the computation [26].Using this method, the least-squares solution depends on the ordering of the columns of the design matrix X.However, there are some linear functions of the parameters, the so-called estimable functions, that have unique solutions.
A linear hypothesis can in general be expressed as where L is a vector or matrix that is a linear combination of the rows of X, which makes Lβ estimable even if the model is overparametrized.This hypothesis (2) can be tested using the F-statistic where Q A ¼ ðL bÞ 0 ðLðX 0 XÞ À L 0 Þ À 1 ðL bÞ, k = rank(L), and ŝ2 e is the estimated error variance, i.e., the mean square error.Under the null hypothesis, F is F-distributed with k and n − rank(X) degrees of freedom.
Another way to deal with the problem of an overparametrized model is through putting constraints on the parameters.Usually this is done by setting the parameters corresponding to the first or last levels of the factors to zero, or just constrain the parameters of each factor in such a way that they sum up to zero, which is called sigma-restriction [4].
Adding random effects to model (1) gives the so-called mixed-effects model: where Z is an n × q incidence matrix of known elements, and u is a q × 1 vector of unknown random effects.It is assumed that u and e are independently distributed, and u * N(0, G), where G is a block-diagonal variance-covariance matrix.Hence, Y * N(Xβ, V), where When estimates are substituted for the variance components of V, this matrix is denoted V .The hypotheses (2) can be tested using the test statistic is the estimate of the fixed effects, β, and Ĉ ¼ ðX 0 V À 1 XÞ À is the estimated variance-covariance matrix of b.This F-statistic is approximately F-distributed with k ¼ rankðL ĈL 0 Þ and û degrees of freedom, where û must be estimated.In the software we consider, there are a number of available methods for computing υ, among them the Kenward and Roger method [19,20].

A two-way fixed-effects model for unbalanced data
In the context of analysis of variance, we consider a special case of model ( 1), a two-way fixedeffects model with interaction: There are three commonly used computing procedures with regard to sums of squares and tests, known as types 1,2 and 3. When data is balanced they are all equal; however, in unbalanced data they are usually not.The type 1 test is a forward sequential procedure.First, the sum of squares for one factor, say A, is computed without the other factor, B, in the model.Next, the sum of squares for B is computed, given that A is already in the model.With this type of test, the result depends on which of the two factors is specified first.For types 2 and 3, the order does not matter.Using type 2, the sum of squares for A is computed given that factor B is already in the model, whereas the sum of squares for B is computed given that A is already in the model.The type 3 test computes sums of squares for A that are adjusted for effects of B and orthogonal to effects of A × B interaction [6].Similarly, the sums of squares for B are adjusted for A and orthogonal to A × B. The type 3 test was defined by SAS [2,6] and is the one that tests the null hypothesis of no differences between the unweighted marginal means.
Specifically, the three types of sums of squares test the following null hypotheses [9,27].Provided that A is specified before B at the analysis, the type 1 test of A tests the null hypothesis where n i. is the total number of observations for the i-th level of factor A. Each cell mean μ ij is weighted with respect to the number of observations in the ij-th cell.The type 2 test of A tests the null hypothesis where n .j is the total number of observations for the j-th level of factor B. The type 3 test of A tests the null hypothesis where m i : ¼ P b j¼1 m ij =b is the unweighted marginal mean of the i-th treatment, i.e., all the cells are weighted equally, regardless of their cell frequencies.
The three types of tests can be carried out through computations on reductions of sums of squares [4].For the type 3 test, this method requires sigma-restricted parametrization.However, the type 3 test coincides with Yates's method of weighted squares of means [28], which does not require sigma-restriction [4,6].Commonly, software neither uses the reduction-ofsums-of-squares method nor the method of weighted squares of means for computations of type 1-3 tests.Instead the general test statistic (3) is used, where L may be specified, depending on the type of test, using the Forward-Doolittle operator [26].

Three mixed-effects models for balanced data
We consider three balanced mixed-effects models (4); the first is a model for analysis of randomized complete block experiments with random effects of blocks: where y ij is the response of the i-th level of the fixed effect A and the j-th level of the random effect B. The intercept is denoted as μ, the fixed effect of treatment as α i , the random effect of block as b j � Nð0; s 2 B Þ, and the residual error effect as e ij � Nð0; The expected values can be expressed as μ i = μ + α i .Under the assumption of model (10), the null hypothesis of no difference between the fixed treatment effects, H 0 : μ 1 = μ 2 = . . .= μ a , can be performed using the F-statistic This test statistic is exactly F-distributed with a − 1 and (a − 1)(b − 1) degrees of freedom.Note that ( 11) is a function of sums of squares and degrees of freedom only.The test is therefore independent of which method is used to estimate the variance components.In particular, it does not matter whether the estimate of s 2 B is positive or not.Next, we consider a split-plot experiment.With random effects of blocks, the following mixed-effects model can be used: where y ijk is the observation in the k-th subplot in the i-th main plot of the j-th block.In (12), μ is the intercept, α i is the fixed effect of the main-plot-treatment factor, b j is the random effect of the block factor, τ k is the fixed effect of the subplot factor, and (ατ) ik is the effect of interaction.The random effects in the model are assumed independently and normally distributed: With fixed effects of blocks, the model can be written: where β j are fixed block-effects, and all other terms are defined as in model (12).The null hypotheses of no difference between the effects of the main-plot treatments, H 0 : μ 1 .= μ 2 .= . . .= μ a. , where m i: ¼ P c k¼1 m ik =c can be tested in the models ( 12) and ( 13) using the exact F-test, the sums of squares for main-plot treatments and interaction, respectively.This test statistic is exactly F-distributed with a − 1 and (a − 1)(b − 1) degrees of freedom.Unlike (5), which is used by most software packages for mixed-effects models, (14) does not require any estimates of the variance components.Thus, it does not matter whether the estimates of s 2 B and s 2 AB are positive or not.

Examples
This section includes two subsections.The first of these illustrates differences between the three types of tests in an unbalanced two-factorial experiment.Specifically, the issue of the computation of the type 3 test is discussed.The model is a fixed-effects model with a single error variance.The estimate of this variance is always positive, so the issue of non-positive variance estimates is not investigated.The second subsection illustrates differences between methods for computing F-tests in balanced randomized complete block and split-plot experiments.Mixed-effects models are used.Differences between methods ( 5), ( 11) and ( 14) for computing the F-statistics are investigated in presence of non-positive estimates of variances components.

Unbalanced two-factorial experiment
Here we present an example using model (6).Weight gain was registered (Table 1) after subjecting each of fifteen animals to one of three different diets (B = Diets).We were specifically interested in the effect of sex (A = Sex), which is specified as the first factor in the model.The data was analyzed using all three types of tests (1-3), and the analysis was carried out using the glm procedure [2,29] in SAS and the lm, anova and Anova functions [1,24] in R. The anova function, with a small initial "a", was used for the type 1 test, whereas the Anova function of the car package [24], with capital "A", was applied for the types 2 and 3 tests.In addition, the ezANOVA function of the ez package [25] and the aov1, aov2, aov3 and GLM functions of the sasLMpackage [30] were investigated.The R script and the SAS program are provided as supporting information in S1 and S2 Files, respectively.
In R, parameters of factors are constrained through coding of factor levels.In this work, we have investigated three options for coding: contr.treatment,which is the default option, contr.sumand contr.SAS.With contr.treatment, the first level of the factor is the reference level, i.e., the parameter corresponding to that level is set to zero, whereas with contr.SAS, the last level is the reference level.With contr.sum,however, there is no reference level, but instead the parameters of the levels are sigma-restricted, i.e., their sum is constrained to zero [24].
Table 2 presents the results of the analyses, when using the lm, anova and Anova functions in R and the glm procedure in SAS.The three types of tests gave very different F-and p- values.Likewise in R, the results with respect to each contrast were highly dependent on the type of test.Regarding the results of types 1 and 2, all methods gave the same result.Since the dataset is unbalanced, the three types of tests are testing different hypotheses.Consequently, the three types were expected to give different F-and p-values.However, within each type of tests, there should be no differences between the methods.The observed differences between the methods for type 3 were not expected.
For type 1, the null hypothesis is and for type 2, the null hypothesis is For type 3, correct results were obtained only using the contr.sumparametrization in R or using SAS.The correct type 3 hypothesis is The test of sex presented by the Anova function as a type 3 test when using the contr.treatmentparametrization is a test of the null hypothesis H 0 : α 2 = 0.With this paramerization, α 1 = β 1 = γ 11 = γ 12 = γ 13 = γ 21 = 0.Because of these constraints, μ 11 = μ and μ 21 = μ + α 2 .Thus, the null hypothesis H 0 : α 2 = 0 is equivalent to In other words, this test of the factor sex is a test of no difference between males and females, provided they take the first diet.Specifically, it should be noted that this test is not a type 3 test, i.e., it does not test the null hypothesis (17).Since contr.treatment is the default parametrization and no warning is provided in the R console, it is likely that this conditional test is often mistaken for a type 3 test.Similarly, using the contr.SAS parametrization, the test of sex presented as type 3 does not test the type 3 hypothesis ( 17), but is a test of no difference between males and females when conditioned on the third diet.SAS does not put any constraints on the parameters but uses the Forward-Doolittle factorization [26,31] to determine the L that gives the correct type 3 test.It should be noted that both the glm procedure of SAS and the Anova function of the car package in R uses (3) for computation of the F-statistic, but these methods, glm and Anova, specify L differently.The ezANOVA function of R gave the same results as those presented for R in Table 2. Thus, the type 3 tests were incorrectly computed when contr.treatment and contr.SAS were used.In contrast, the aov1, aov2, aov3 and GLM functions produced the same results as the glm procedure of SAS.In this example, these functions performed types 1-3 tests correctly.

Randomized complete block and split-plot experiments
In the following examples, the mixed procedure [2,15] in SAS and the lmer function [32] in R were used for fitting models with both fixed and random effects, i.e., mixed-effects models (4).With both the mixed procedure and the lmer function, the variance components were estimated using the restricted maximum likelihood (REML) method.Using lmer, variance components are bounded to be non-negative.This is also the default setting of the mixed procedure.However, using the nobound option of the mixed procedure, negative estimates of the variance components are allowed.The analyses were carried out using version 9.4 of SAS and version 4.2.2 of R.
Three examples following the models ( 10), ( 12) and ( 13) were considered.In the randomized complete block model (10), we studied tests of treatments.In the split-plot models ( 12) and ( 13), we studied tests of main-plot treatments.Exact F-tests were calculated using (11) for model (10), and ( 14) for models (12) and (13).In both SAS and R, the F-values were computed as in (5), and the Kenward and Roger method [18,19], as implemented in the software, was used for computation of denominator degrees of freedom.Datasets, SAS programs and R scripts are included as supporting information.Results are summarized in Table 3.
In Example 1, a randomized complete block experiment with a = 2 treatment levels and b = 4 block levels was analyzed using model (10).Variance component estimates were ŝ2 B ¼ À 0:1063 and ŝ2 e ¼ 0:2996 using the mixed procedure with the nobound option.Without this option or using lmer, variance component estimates were ŝ2 B ¼ 0 and ŝ2 e ¼ 0:1932.Using the exact F-test (11), denominator degrees of freedom were calculated as (a − 1)(b − 1), which agrees with the results obtained using the nobound option of the mixed procedure.These methods also agree with regard to F-and p-values.Using the default setting of the mixed procedure, denominator degrees of freedom were calculated as a(b − 1), i.e., as if using a one-way ANOVA model.This computation gave a significant result (p = 0.026), while the computation using the nobound option did not (p = 0.100).With lmer, denominator degrees of freedom were the same as obtained with the nobound option of the mixed procedure.However, the F-value was the same as obtained with the mixed procedure without using the nobound option.Notice that with default settings, SAS and R gave different results.In the former, the effects of treatments were significant (p = 0.026), but in the latter they were not (p = 0.061).Neither software package provided the exact F-test value by default.
In Example 2, a split-plot experiment was analyzed using model (12).There were b = 2 blocks, a = 3 main plots per block and c = 12 subplots per main plot.Using the nobound option of the mixed procedure, the estimates of the variance components were ŝ2 B ¼ 15:5005, ŝ2 AB ¼ À 0:04346 and ŝ2 e ¼ 0:8775.Without this option, or using lmer, the estimates were ŝ2 B ¼ 15:487, ŝ2 AB ¼ 0 and ŝ2 e ¼ 0:848.The results of the exact F-test and the mixed procedure were the same when the nobound option was used.The denominator degrees of freedom obtained using lmer were equal or close to the correct number (a − 1)(b − 1) = 2.While the result using the nobound option of the mixed procedure was significant, (p = 0.042), it was highly significant (p = 0.001) without this option.Using the default settings of the mixed procedure, denominator degrees of freedom were computed as (ac − 1)(b − 1) = 35.The same F-value, 9.58, was obtained using the mixed procedure with default settings and using lmer.The results using the default settings of SAS and R were very different.In SAS, effects of main-plot treatments were highly significant (p = 0.001); however in R they were not (p = 0.094).
In Example 3, another dataset, with the same experimental layout as in Example 2, was analyzed, but using model (13).In this example, block effects were considered fixed.Unbounded variance component estimates (proc mixed, nobound) were ŝ2 AB ¼ À 0:0441 and ŝ2 e ¼ 1:0856, and bounded (proc mixed, default and lmer) were ŝ2 AB ¼ 0 and ŝ2 e ¼ 1:055.With fixed effects of blocks, results using the mixed procedure with the nobound option were the same as with the exact F-test.As in Example 2, the p-value differed between the default SAS and R.
The results of Table 3 were all produced using type 3 tests.In R, the contr.sumparametrization was used, and the tests were provided by the lmerTest package [33].Note that the differences between the methods in Table 3 are not due to type 3 test calculations, because the examples are balanced.The differences are explained by the fact that the exact F-test uses equations ( 11) and ( 14), while the other three methods all use equation ( 5) but with different variance estimates and degrees of freedom.
Neither the type of test nor the choice of parametrization should matter when the dataset is balanced.However, it was noted that with types 1 and 2 tests, the F-statistic was occasionally computed as 0 when contr.treatment or contr.SAS was used, whereby the p-value was correspondingly computed as 1.In SAS, the same phenomenon occurred when analyzing Example 3 using model (12).

Design
In the randomized complete block and split-plot examples, methods gave different results although datasets were balanced.Since those were just a few examples, a simulation study was performed, exploring a wider range of datasets.Specifically, the Type I error rates of the different options for hypothesis testing were estimated.
For each of the models, six layouts, i.e., combinations of a, b and c were investigated, as specified in the first columns of Tables 4 and 5, for model (10) and (12), respectively.In model (10), the error variance was s 2 e ¼ 1, but different values were adopted for the block variance: s 2 B ¼ 0:1; 0:3; 0:5; 0:7; 0:9.In model (12), the block and split-plot error variances were s 2 b ¼ 1 and s 2 e ¼ 1, respectively, while the main plot error variance varied: s 2 AB ¼ 0:1; 0:3; 0:5; 0:7; 0:9.All fixed parameters, i.e., μ, α i , τ k and (ατ) ik in models ( 10) and ( 12) were zero.For each case, i.e., row of Tables 4 and 5, 200 000 datasets were generated, and for each generated dataset, the null hypothesis H 0 : μ 1 = μ 2 = . . .= μ a in model (10) and H 0 : μ 1. = μ 2. = . . .= μ a. in model (12), was tested at significance level 0.05.With 200 000 simulations, an approximate 0.95 tolerance interval of ±1.96(0.05(1− 0.05)/200000) 1/2 = ±0.001was achieved.In addition, 10 000 datasets were generated for each case to estimate the probability of a non-positive estimate of s 2 B in (10) and s 2 AB in (12).Analyses were performed using the lmer and Anova functions of R and the mixed procedure of SAS.In R, type 2 tests were applied with contr.sumparametrization.Also in SAS, type 2 tests were used.The choice of type of test should not matter, as all designs were balanced.In SAS, the analyses were performed both with and without the nobound option.The Kenward and Roger method [18,19] for computation of denominator degrees of freedom was applied as implemented in the software.In addition to these likelihood-based procedures, the null hypotheses were tested using the exact F-tests ( 11) and ( 14), utilizing the anova procedure of SAS for computation of sums of squares.For the exact F-tests, denominator degrees of freedom were calculated as (a − 1)(b − 1).
For the specific layout {a = 3, b = 2, c = 12} in model ( 12) with parameter values s 2 e ¼ 1, s 2 b ¼ 1 and s 2 AB ¼ 0:1, an additional simulation study was performed, at which the null hypothesis H 0 : μ 1. = μ 2. = . . .= μ a. was tested using type 3 tests.In R, using the lmer function, the contr.sumparametrization was applied.In SAS, using the mixed procedure, tests were made both with and without the nobound option.Furthermore, the exact F-test (14) was included.This simulation study comprised 10 000 simulated datasets.The quantiles of the obtained p-values were plotted against the quantiles of the continuous uniform distribution U (0, 1).

Results
Table 4 shows the results of the simulation study using model (10).In R, with regard to all cases, frequency of Type I error was slightly too high.Especially when s 2 B was low, larger values than 0.05 were obtained, i.e., the null hypothesis was rejected somewhat too frequently.In those cases, the block variance, s 2 B , was more often estimated as zero than in other cases.In SAS, frequency of Type I error was close to the nominal level 0.05 in all cases, provided the nobound option was applied, except for the layout {a = 3, b = 2}.However, using the default setting of the mixed procedure, i.e., without using the nobound option, frequency of Type I error was consistently too large.A positive correlation was observed between frequency of Type I error and frequency of block variance estimated as zero.As expected, the exact method showed frequencies of Type I error close to the nominal level.Table 5 provides the results for model (12).In R, with layouts {a = 2, b = 4, c = 12} and {a = 3, b = 2, c = 12}, frequency of Type I error was much lower than the nominal level 0.05.In all examples, it was noticed that for lower values of s 2 AB , frequency of Type I error was below 0.05.Using the mixed procedure with the nobound option, frequency of Type I error was close to the nominal level 0.05 in all cases, except with layout {a = 3, b = 2, c = 12}, where Type I error rates were too low.Using the mixed procedure with default settings, frequency of Type I error was almost always higher than the nominal level 0.05.Very high frequencies were observed for the problematic layout {a = 3, b = 2, c = 12}.It is noteworthy that results obtained with R and the default setting of SAS were sometimes very different.With the layout {a = 3, b = 2, c = 12}, frequency of Type I error varied between 0.00 and 0.03 in R, and between 0.08 and 0.10 in SAS.This layout was characterized by relatively large probabilities for a non-positive estimate of the main-plot variance s 2 AB .The results of the additional simulation study are presented in Fig 1 .In these Q-Q plots, the performance of the methods can be studied for arbitrary levels of significance.As shown in Fig 1(A), the exact method ( 14) gave a uniform distribution of p-values, as expected.In Fig 1(B), the curve is above the reference line, indicating too many non-significant results when using the lmer function of R. With this method, the null hypothesis is rejected too rarely, due to the F-value being smaller than with the exact method (Table 3, Example 2).
The curve in Fig 1(C), for the mixed procedure of SAS using the default setting, is notably below the reference line at low levels of significance.Thus, when testing at significance levels 0.05 or 0.10, too many significant results are obtained using this method.This is a consequence of too many denominator degrees of freedom as compared to the exact method (Table 3, Example 2).For the mixed procedure using the nobound option, the curve of Fig 1(D) runs above the line.Although this is particularly so at large levels of significance, also at small levels e.g., 0.05, too few significant results are obtained (Table 5, design {a = 3, b = 2, c = 12}).

Practical advice
This article discussed two issues: i) type 3 tests, and ii) non-positive estimates of variance components.The first issue is a problem, since the Anova and ezANOVA functions of R sometimes perform type 3 tests incorrectly.The second issue is a problem too, since the actual significance level, i.e., the true probability of incorrectly rejecting the null hypothesis, may differ from the nominal significance level.Here, we give some practical advice on how to work around these problems in R and SAS when possible.
Correct type 3 tests are obtained in R using the Anova function of the car package if the line options(contrasts = c(''contr.sum'',''contr.poly'')) is included in the beginning of the R script.This code changes the contrasts globally.The code contrasts(A) <-contr.sum is useful if the change is to be made only for factor A. Note that the Anova function is spelled with an initial capital A and is different from the anova function.Just like the Anova function, the ezANOVA function is dependent on the choice of contrast, so the default contr.treatmentshould not be used for type 3 tests, but contr.summay be employed instead.Alternatively, the aov3 or the GLM function of the sasLM package may be preferred.These functions give correct type 3 tests even when contr.treatment is used.For mixedeffects models, correct type 3 tests are provided by the lmerTest package.These are obtained using the type argument of the anova function when applied to a model fitted by the lmer function.Note that the lmerTest package must be loaded before the model is fitted.No specific options are needed for correct type 3 tests using the glm and mixed procedures of SAS.
For mixed-effects models and balanced datasets, exact F-tests based on sums-of-squares computations, can be obtained using the test statement of the anova and glm procedures of SAS, where the user specifies the correct error terms.Similarly in R, exact F-tests can be computed using the aov function, which allows for a single error term in addition to the residual error term.For split-plot experiments, the sp.plot function of the agricolae package [34] is another option.
For analyses using the mixed procedure of SAS, we recommend allowing negative estimates of variance components when the main purpose is to draw conclusions about the fixed effects.This is made possible by the nobound option.Unfortunately, a similar option is not available for the lmer function.Although the tests are correctly implemented, the actual significance level deviates from the nominal one, as assessed in the Simulation section.Exceptional results such as discussed in the last paragraph of the Examples section might in R be avoided by using contr.sum.

Discussion
Using different statistical software, a broad variation in results can be observed, even though the same statistical model is assumed.Hence, the accuracy of the results renders the choice of the statistical method for data analysis essential.This article investigated the glm procedure in SAS and several functions for analysis of variance in R, as applied to an unbalanced dataset with two factors in a crossed design.Furthermore, the mixed procedure of SAS and the lmer function of the lme4 package in R were studied with focus on balanced randomized complete block and split-plot experiments.
In unbalanced fixed-effects experiments, exact F-tests can be calculated from three different types of sums of squares, known as types 1, 2, and 3.While in balanced data, these are identical, in unbalanced data they correspond to different hypotheses [4,9,27].The hypotheses of the type 3 tests do not depend on the number of observations in each treatment combination, and are therefore often preferred when analyzing unbalanced data.In addition to the choice of type of test, the user of R needs to select how the parameters of the factors should be constrained by coding of the factor levels, known as contrast functions.Here, three such contrasts were investigated; contr.treatment,contr.sumand contr.SAS [24].Correct type 3 tests of no differences between the marginal means are obtained with the glm procedure of SAS and the sasLM package in R. Correct type 3 tests are also obtained with the Anova and ezANOVA functions of R when using contr.sum[35].However, incorrect results are obtained in R with these functions when the default parametrization, contr.treatment, is applied.The tests obtained using this default contrast in R are not type 3 tests of marginal means, but tests of means conditioned on the first level of the other factor.
There is a misconception that type 3 tests are only correct if orthogonal coding is used [36].According to the help page for the Anova function of the car package, type 3 tests 'will normally only be sensible when using contrasts that, for different terms, are orthogonal in the row-basis of the model, such as those produced by contr.sum,contr.poly,or contr.helmert,but not by the default contr.treatment.'This statement is misleading.The type 3 test is sensible regardless of the contrast used, provided it is correctly computed.The type 3 test is equivalent to Yates's method of weighted squares of means [28], which does not result from zero-sum restrictions [4,6].In fact, the term 'type 3 test' origins from the SAS software [6], where orthogonal coding is not used [2].In R terminology, the type 3 test is not dependent on which contrast is employed.However, the so called 'type 3' test provided by the Anova function of the car package is dependent on the choice of contrasts.In this article, we have explained what hypotheses the Anova function is actually testing.
In linear mixed-effects modelling of experiments, estimation of variance components and computation of denominator degrees of freedom, associated with inference on the fixed effects, are important parts of the analysis.Using the lmer function of R, variance estimates are always bounded at zero.Similarly, the mixed procedure of SAS has a default lower boundary constraint of zero.With variance components estimated to zero, the denominator degrees of freedom of the F-statistics are calculated differently in R and SAS, despite both software employing the Kenward and Roger method.
In the simulation study, Type I error rate was investigated, having the variance components either constrained at zero or not, using the different options offered by the software.With the default setting of SAS, higher frequencies of Type I error were observed in SAS than in R, due to the fact that the number of degrees of freedom is calculated differently and is higher in SAS.Thus, significant results are obtained more often with SAS than with R. Using the nobound option of the mixed procedure, i.e., allowing negative estimates of variance components, the frequency of Type I error was close to the correct value.This method can be recommended since it controls Type I error [37].However, in our study, inaccurate results (F = 0) occasionally occurred when using the nobound option.For more complicated models than those investigated in this article, the default setting of the mixed procedure may be preferable for practical reasons, as problems with convergence may presumably arise more frequently with the nobound option.
Negative estimates of variance components, as allowed by the nobound option, may seem strange since variances cannot be negative.It is certainly challenging to explain the occurrence of negative estimates of variances.Furthermore, when any variance component is negative, it does not make sense to calculate the total variance and the components' percentages of this total variance, as may otherwise be done.It is an old question how to act when estimates of variances become negative [3,11].One recurring suggestion is to replace negative estimates with zero, but such procedure disturbs the properties of the estimates.In particular, the distribution of the F-statistic is affected so that it is no longer F-distributed.
Linear mixed-effects models are very useful for analysis of complicated datasets, but hypothesis tests are usually approximate.For statistical analysis of simple experiments, there are exact methods based on sums of squares calculations, which should be preferred.

Fig 1 .
Fig 1. Quantiles of observed p-values, testing H 0 : μ 1. = μ 2. = μ 3. versus quantiles of the uniform distribution U(0,1).Based on 10 000 simulated datasets using model (12) with layout {a = 3, b = 2, c = 12} and parameter values s 2 e ¼ 1, s 2 b ¼ 1 and s 2 AB ¼ 0:1.(A) Exact F-test.(B) R using the lmer and anova functions.(C) SAS using the mixed procedure with default settings.(D) SAS using the mixed procedure with the nobound option.https://doi.org/10.1371/journal.pone.0295066.g001 e ijk ; i ¼ 1; . . .; a; j ¼ 1; . . .; b; k ¼ 1; . . .; n ij ; ð6Þ where μ is an intercept, and α i and β j are unknown fixed effects for the factors A and B, respectively.The interaction of the i-th and j-th level of the two fixed factors A and B is represented by γ ij .The experimental errors, e ijk , are assumed independently distributed, e ijk � Nð0; s 2 e Þ.The data can be written in a table with a rows, b columns and n ij observations in the ij-th cell.Notice that this two-way table has no empty cells, i.e., n ij � 1, but the numbers of observations may vary between the cells.The total number of observations is n :: ¼